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Characterization of the Complete Chloroplast Genome 
of Apple (Malus x domestica, Rosaceae) * 
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Abstract: Apple (Malus x domestica) is one of the most important temperate fruits. To better understand the molec- 
ular basis of this species, we characterized the complete chloroplast (cp) genome sequence downloaded from Ge- 
nome Database for Rosaceae. The cp genome of apple is a circular molecule of 160 068 bp in length with a typical 
quadripartite structure of two inverted repeats (IRs) of 26 352 bp, separated by a small single copy region of 19 180 
bp (SSC) and a large single-copy region (LSC) of 88 184 bp. A total of 135 predicted genes (115 unique genes, 
and another 20 genes were duplicated in the IR) were identified, including 81 protein-coding genes, four rRNA 
genes and 30 tRNA genes. Three genes of ycf15, ycf68 and infA contain several internal stop codons, which were in- 
terpreted as pseudogenes. The genome structure, gene order, GC content and codon usage of apple are similar to the 
typical angiosperm cp genomes. Thirty repeat regions ( 2230 bp) were detected, twenty-one of which are tandem, six 


are forward and three are inverted repeats. Two hundred thirty-seven simple sequence repeat (SSR) loci were re- 
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vealed and most of them are composed of A or T, contributing to a distinct bias in base composition. Additionally, 


average 10 000 bp non-coding region contains 24 SSR sites, while protein-coding region contains five SSR sites, in- 


dicating an uneven distribution of SSRs. The complete cp genome sequence of apple reported in this paper will facili- 


tate the future studies of its population genetics, phylogenetics and chloroplast genetic engineering. 


Key words: Apple; Chloroplast genome; Repeat analysis; SSRs 


Apple, Malus x domestica Borkh., belongs to 
the tribe Pyreae of Rosaceae ( Potter et al., 2007) , 
cultivated all over the world except in Tundra cli- 
mates and the arctic regions. Apple is one of the ol- 
dest and most economically important temperate 
fruit. Globally, there are more than 7 500 known 
cultivars of apples, resulting in a range of desired 
characteristics. According to the data from the Food 
and Agriculture Organization of the United Nations, 
the total apple production in 2010 was about 69 mil- 
lion tons, and the overall area of apple plantation 
was 5. 62 million hectares (www. fao. org). Apple is 
considered to have the best economic value, but this 
species is highly susceptible to a number of fungal, 
bacterial diseases and insect pests, which annually 
reduce the harvest by 1296 to 2596. However, intro- 
duction or deletion of target genes by means of con- 
ventional hybridization is generally costly, of low ef- 
ficiency and a long-term process because of the high 
heterozygocity and long juvenile period of the apple 
plants. 

The chloroplasts ( cp) , considered to be origi- 
nated from cyanobacteria through endosymbiosis are 
the photosynthetic organelles that provide essential 
energy for plants and algae ( Howe et al., 2003; 
Gray, 1989). This intracellular organelle encodes a 
number of chloroplast-specific components and in- 
volves in major functions such as sugar synthesis, 
starch storage, the production of several amino 
acids, lipids, vitamins and pigments and also in key 
sulfur and nitrogen metabolic pathways ( Martin et 
al., 2013). Earlier studies have demonstrated that 
gene content, gene order, and genome organization 
of ср genome are largely conserved within land 
plants with restriction site mapping ( Raubeson and 


Jansen, 2005; Palmer, 1991). However, with the 


increasing number of whole cp genome available, 
many structural rearrangements , large IR expansion/ 
contraction and gene loss have been found ( Chumley 
et al., 2006; Millen et al., 2001; Guisinger et al. , 
2010). These events coupled with sequences per se 
provide sufficient information for genome-wide evolu- 
tionary studies. It has shown great potentials in resol- 
ving phylogenetic questions at both high and low ta- 
xonomic levels, and sometimes it is necessary to use 
complete cp genome sequences for resolving complex 
evolutionary relationships ( Givnish et al., 2010; 
Downie and Palmer, 1992; Jansen et al., 2007). 
Meanwhile , comparative analysis of cp genomes from 
distant and closely related species will facilitate the 
association of important traits controlled by plastid 
genomes (Liu et al., 2013). 

Velasco et al. (2010) reported a high-quality 
draft genome sequence of apple and reconstructed 
the phylogeny of the genus Malus applying 23 nucle- 
ar genes, the progenitor of the cultivated apple have 
been identified as M.sieversii. Compared with the 
nuclear genome sequence, our understanding of ap- 
ple's ep genome is left behind, although the com- 
plete chloroplast genome of apple has been released 
alonged with nuclear genome sequence ( Velasco et 
al., 2010, http: //www. rosaceae. org/ projects/ ap- 
ple. genome). In this article, we annotated the cp 
genome of apple in detail. In addition, we deter- 
mined the distribution and location of microsatellites 
(SSRs) and repeats in the apple cp genome. The 
obtained cp genome information will be widely used 


for its population genetics and breeding programs. 


1 Materials and methods 


1.1 Genome annotation 
Velasco et al. (2010) have assembled Malus x 
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domestica cp genome sequence with 847 Х coverage. 
This high-quality cp genome sequence can be down- 
loaded from GDR/Genome Database for Rosaceae 
(http: //www. rosaceae. org/ projects/ apple _ genome). 
The cp genome was annotated using the program 
DOGMA (Wyman et al. , 2004), coupled with man- 
ual corrections for start and stop codons and intron/ 
exon boundaries. The tRNA genes were identified u- 
sing DOGMA and tRNAscan-SE ( Schattner et al., 
2005). Codon usage was analyzed using VB script. 
The circular cp genome map was drawn using OG- 
DRAW program (Lohse et al., 2007). 
1.2 Repeat analysis 

REPuter ( Kurtz et al., 2001) was used to vi- 
sualize both forward and inverted repeats. The mini- 
mal repeat size was set to 30 bp and the identity of 
repeats was no less than 90% (hamming distance e- 
qual to 3). Tandem repeats were analyzed using 
Tandem Repeats Finder ( TRF) v4.04 ( Benson, 
1999) with parameter settings as described by Nie et 
al. (2012). Overlapping repeats were merged into 
one repeat motif whenever possible. А given region 
in the genome was designated as only one repeat 
type, and tandem repeat was prior to other repeats if 
one repeat motif could be identified as both tandem 
and other ones. 
1.3 SSR analysis 

We detected SSRs longer than 8 bp from apple 
cp genome. This threshold was set because SSRs of 8 
bp or longer are prone to slip-strand mispairing, 
which is thought to be the primary mutational mecha- 
nism causing their high level of polymorphism ( Huo- 
tari and Korpelainen, 2012; Raubeson et al. , 2007; 
Rose and Falush, 1998). Microsatellites ( mono-, 


di-, tri-, tetra-, penta-, and hexa-nucleotide re- 
peats) detection was performed using MISA ( Thiel 
et al., 2003) with minimum number of repeats of 8, 
4,4,3,3,3 for 1, 2, 3, 4, 5, 6 unit size, re- 
spectively. SSRs analysis only considered one invert- 
ed repeat region (IRb). All of the repeats found 
were manually verified, and the redundant results 


were removed. 


2 Results 
2.1 Genome organization 

The complete cp genome of apple is a circular 
DNA molecule of 160 068 bp with a quadripartite 
structure typical of the majority of the land plant 
chloroplast chromosomes. It has the largest cp ge- 
nome size among five Rosaceae species ( Table 1). 
The cp genome harbors a pair of identical inverted 
repeat regions ( IRa and IRb) , which are 26 352 bp 
each. The inverted repeat regions are separated by 
the large (LSC) and small (SSC) single-copy re- 
gions of 88 184 and 19 180 bp, respectively ( Table 
1, Fig. 1). The IRs span from rps/9 to portion of 
усї. The overall GC content of the apple cp genome 
is 36. 5%, 42. 7% within the inverted repeat region , 
34. 2% and 30. 496 within the LSC and SSC ( Table 
2). The high GC content of IRs is caused by four 
GC-rich rRNA genes ( with an average GC content of 
55.5%). 
2.2 Gene content 

The positions of all the genes identified in the 
apple cp genome and category-wise distribution of 
these genes are presented in Figure 1 and Table 3. 
The apple cp genome encodes 135 predicted genes, 


of which 115 are unique. The unique genes include 





Table 1 Summary of the Rosaceae cp genome features 

dod Genbank Genome Size LSC length TRa length SSC length ороноо 
/bp /bp /bp /bp 

Fragaria vesca subsp. vesca МС. 015206 155691 85606 25555 18175 Shulaev et al., 2011 
Pentactina rupicola КС. 016921 156612 84970 26350 19237 Lee and Hong, 2011 
Prunus persica КС. 014697 157790 85969 26381 19060 Jansen et al., 2011 
Pyrus pyrifolia NC. 015996 159922 87901 26392 19237 Terakami et al. , 2012 
Malus X domestica 160068 88184 27352 19180 This study 
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81 protein-coding, 30 tRNA and four rRNA genes all four rRNA genes are duplicated in the IR regions. 
(Table 3). Nine protein-coding, seven tRNA and Protein-coding genes, tRNAs and rRNAs make up 


Malus x domestica 


chloroplast genome 


160,068 bp 
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Fig. 1 Мар of the apple cp genome 
The thick lines indicate the extent of the IRs ( IRa and IRb) which separate the genome into SSC and LSC regions. Genes lying outside the 
map are transcribed clockwise whereas gene inside are transcribed counter clockwise. Genes belonging to different functional groups are color 


coded. Area dashed darker gray in the inner circle indicates GC content while the lighter gray corresponds to AT content of the genome 


Table 2 Base composition in the apple chloroplast genome 





Genome features Codon composition A/ 96 T(U)/% G/% C/% Length/bp 

LSC 32.2 33.6 16.6 17.6 88 184 
SSC 34.8 34.8 14.5 15.9 19 180 
TRa 28.6 28.7 20. 6 22. 1 26 352 
IRb 28.7 28.6 22.1 20.6 26 352 
Total 35.1 32.1 17.9 18.6 160 068 
CDS 30.9 31.4 20. 1 17.5 79 650 

Ist position 31.0 23.8 26.7 18.5 26 550 

2nd position 29.6 32.5 17.8 20.1 26 550 

3rd position 32.2 38.0 15.9 13.9 26 550 


CDS; Coding DNA Sequence 
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47.9%, 1.7% and 5.4% of the genome, respective- 
ly, while introns and intergenic spacers constitute the 
remaining 45. 0%. The LSC region contains 61 рго- 
tein-coding genes and 22 tRNA genes, whereas the 
SSC region contains 11 protein-coding genes and one 
tRNA gene. Eighteen genes in the apple cp genome 
contain introns, three (сірР, rps12 and ycf3) of which 
consisted of two introns ( Table 4). The trnK-UUU 
has the largest intron (2 516 bp), where another 


gene, таіК, is nested within it. For the rps/2 gene, 


the 5” 
exon is located in the IR regions. The ус/7 and rps19 


exon is located in the LSC region, and the 3” 


are located in the boundary regions between IRb/SSC 
and IRa/LSC, respectively. Incomplete duplications 
of the normal copy of ус/7 and rps19 at these bounda- 
ries have resulted in a lack of protein-coding ability. 
The psbD-psbC and ycfl-ndhF are two cases of over- 
lapping genes. 

2.3 Codon usage 


Based on the sequences of protein-coding genes 


Table 3 Genes present in the apple chroloplast genome 


Group of genes Gene names 





Photosystem I psaA , psaB , psaC, psal, p: 
Photosystem П 
Cytochrome b/f complex 
ATP synthase 

NADH dehydrogenase 
RubisCO large subunit rbcL 
RNA polymerase 

Ribosomal proteins ( SSU ) 
Ribosomal proteins ( LSU) 
Other genes 

Proteins of unknown function 


"Transfer RNAs 
Ribosomal RNAs 


sad 


троА, троВ, троС1” , rpoC2 


psbA, psbB, psbC, psbD, psbE, psbF , psbH, psbl, psbJ, psbK, psbL, psbM, psbN, psbT, psbZ 
реА, petB* , ре)”, petG, petL, petN 

atpA, atpB, арЕ, atpF* , арн, арі 

ndhA* , ndhB* (x2), ndhC, ndhD, ndhE, ndhF , ndhG, ndhH, пам, ndh], ndhK 


rps2, трв3, rps4, rps7 (х2), rps8, трв11, rps12™ (x2) , трв14, тр815, rpsl6" , rps18, rps19 
rpl2 * ( X2) , rpl14, rpl16* , rpl20, rpl22, rpl23 (х2), rpl32, rpl33, rpl36 

clpP * , matK , accD, ccsA, infA, cemA 

YSL, ус? (х2), yd37 ‚ зеў, yef15 (2) , 9468 (х2) 

37 tRNAs (6 contain an intron, 7 in the IRs) 

тт4. 5 (х2), rrn5 (х2), mnl6 (х2), тт23 (x2) 


Опе or two asterisks after genes indicate that gene contains опе or two introns, respectively 


Table 4 The genes with introns in the apple cp genome and the length of the exons and introns 





Gene Location Exon I/bp Intron I/bp Exon Ibp Intron Ibp Exon Ш/Ър 
atpF LSC 411 733 144 
clpP LSC 228 650 291 824 69 
ndhA SSC 540 1141 552 
ndhB IR 756 670 777 
petB LSC 6 798 642 
petD LSC 9 725 474 
rpl16 15С 399 989 9 
rpl2 IR 435 687 390 
троС1 15С 1611 742 435 
rps 12“ LSC 114 = 26 542 231 
rps16 LSC 231 860 42 
trnA-UGC IR 38 808 35 
trnG-GCC LSC 23 707 37 
trnl-GAU IR 42 944 35 
trnK-UUU LSC 35 2516 37 
trnL-UAA LSC 37 515 50 
trnV-UAC LSC 37 593 39 
ycf3 15С 153 745 228 709 126 


The rps12 is а trans-spliced gene with the 5’ end located іп the LSC region and the duplicated 3’ end in the IR regions 
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and tRNA genes within the chloroplast genome, the 
relative synonymous codon usage ( RSCU) ( Sharp 
and Li, 1986) was deduced for the apple genome 
and summarized in Supplementary Table 1. The co- 
don usage of the apple chloroplast genome strongly 
reflects the AT bias. Within coding DNA sequence 
(CDS), the percentage of AT content for the first, 
second and third codon positions are 54. 8%, 62. 196 
and 70.296, respectively ( Table 2). Moreover, the 
81 protein-coding genes comprise 79 650 bp coding 
for 26 550 codons. Among these codons, 2 781 
(10. 596) encode leucine, and 307 (1. 196) encode 
cysteine, which are the most and least prevalent a- 
mino acids, respectively. The highest codon usage 
was observed for ATT or isoleucine (Ile). High co- 
don usage was also observed for Lysine (Lys) and 
Glutamine (Glu) (Supplementary Table 1). Instead 
of a common АТС start codon, we identified GTG as 
start codon for rps/9. All of three stop codons are 
present with UAA being the most frequently used 
(UAA 58. 896, UAG 23. 396 and UGA 17. 896). 
2.4 Non-functional genes 

The gene ус/15 employs СТС as start codon and 
several stop codons were detected, which indicates 
that it is most likely to be a non-functional gene. The 
reading frame of this gene contains one insertion 
“АСТА” unit, causing the frameshift and the resul- 
ting internal stop codons ( Fig. 2: A). On the other 
hand, the ycf68 gene is a truncated pseudogene with 
accumulated stop codons in its reading frame, which 
caused by one absence * AAAC" unit and two dele- 
tion events (total 13 bp) (Fig.2: B). We also 
found infA gene was probably non-functional in apple 
chloroplast genome due to the presence of several 
premature stop codons caused by insertion of one 
“ТАТС” unit (Fig. 2: C). 
2.5 Repeat analysis 

For repeat structure analysis, we detected six 
direct, three inverted and 21 tandem repeats in the 
apple cp genome ( Supplementary Table 2). Most of 
these repeats exhibit length between 30 and 41 bp 
(Fig. 3: A). The longest repeat of 91 bp is located 


in the intergenic region between psbZ and trnG-UCC 
within the LSC. Tandem repeats, accounting for 7096 
of total repeats, are the most common among three re- 
peat types ( Fig. 3: B). Most of the repeats (7696) 
are distributed within the intergenetic spacer regions, 
together with 896 in the introns, 896 in the CDS re- 
gion and 8% in the tRNA, respectively (Fig. 3: С). 
2.6 SSR analysis 

Chloroplast simple sequence repeats (SSRs) of 
apple were examined and listed in Supplementary 
Table 3, along with their nucleotide sequences and 
positions within the cp genome. We indentified 237 
SSR loci ( Z8 bp) totally, of which 164 mononucle- 
otide, 68 dinucleotide, four tetranucleotide, and 
one hexanucleotide. Among these cpSSR nucleotide 
units, the longest one is a polyT of 26 bp, and the 
majority of mononucleotide repeat units are com- 
posed of A (64) or T (94) , while only six are com- 
posed of tandem G or C. The majority of repeat units 
are ~9 bp long (62 with 8 bp, 39 with 9 bp, 21 
with 10 bp), which are accounted for 51.4896 
(122/237) of all cpSSRs. CpSSRs are unevenly dis- 
tributed across the whole genome: 175 in the LSC, 
23 in the IRb, and 39 in the SSC regions. Analyses 
of function-related location revealed 158 cpSSRs lo- 
cate in intergenic spacer regions, 38 in introns, and 
41 in CDS of 18 genes, among which, 17 genes 


were found to harbor at least two SSRs. 


3 Discussion 
3.1 Genome Organization 

In general, the size of photosynthetic land plant 
plastid chromosomes ranges from 108 kb to 165 kb 
(Palmer, 1991; Raubeson and Jansen, 2005). The 
cp genome of apple is at the upper boundary, which 
is also the largest one among the five available Rosa- 
ceae cp genomes. It is about 0.1 kb, 2.2 kb, 3.4 
kb and 4.3 kb larger than Pyrus pyrifolia, Prunus 
persica, Pentactina rupicola and Fragaria vestica 
subsp. vestica, respectively. The genome size varia- 
tion is mainly caused by differences in the length of 


SSC and IR regions (Table 1). 
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ШЕШШ S НЫН к к ШШШЫШ NO шсш к -E Т Т БЫНШЫЫЫНЫ К Р к Т Т Е 558 Р ШШЕ г ЕЗШ 


Malus GIGACGGAGGGATCCTACCATTCGAGCCTITTTTTTCATGCTTTICCCGGAGGTCTGGAGAAAGCTGCAATCAATAG 
CECE D РЕШШ Р F Е 5 СЕ S Е К S SOS С кош 


Oryza 


GTGACGGAGGGATCGTACCATTCGAGCC A ME SOS e eH Le Eee tr teuer cura ortu ане: Aeg eie oye АА onto red TCC СА 
A A F Si 

Zea GTGACGGAGGGATCGTACCATTCGAGCC К ООА E ETC Tena AGAR TM ТЕН СЕЛЕН ООА СОС IC CIS 

DEEDESCHSCUNINENN D F ШЕН D. Р К ШИШ А К Р Oe К A ШШШ Н К Т 5 eee Р 5 ШШ 





: | 

ШШЕН и ИШЕНЕ ИШЕНЕ БЕ ШП ПЫШ NES ! ШШШ DOE a ee I NEN MENS 
Маіиѕ NIGAAACAACAAAAATGGATTCATGAGGGTTTAATTACTGAACAACTTACCGATGGTATGTATCTTCCGGATTCGTTTAGATAACAAAAAAATTATTCTGGTTTTIGTTICGTAAAGGATTT 
Sasamum NTGARANGAAAAAARATGGATTCATGAAGGTTTAATTACCGAATCGCTGCCAAATGGCATG-  TTCCGGGTTCGGTTAGATAATGAAGATCTGATTCTAGGTTATGTTTCAGGAAAGATCC 
Vitis ATGAAAGAACAAAAATGGATTCATGAAGGTTTAATTACTGAATCACTTCCCAACGGTATG-—— TTCCGGGTGCGTTTAGATAATGAAAATATGATTCTAGGTTATGTTTCAGGAAGGATCC 


Malus САТСТАБТТТАТАССТАТАСТАССАСАААТАСАСТААВВАТТСАССТВАСТССТТАТСАТТСВАССВААССССОТАТАВ ТТАСАСАСТССАЋАССАААСАСТТСВАТСА 
MMF Ү TY TTR кит к нисек санае NOR л Y N ТРК Q К ВЕ 


Sasamum САССТАСТТТТАТАСССАТАСТСССАССАСАТАААСТСААААТТСААСТААСТССТТАТСАТТСААССАСАССАССТАТААТТТАТССАСТСССАААСААССАТТССАААСАТТАС 
ЕК 5 FEN R EHE P ШСШЫШ К БН КИНИНИ 5 ГЕЗИ S Т К ОШЕН К ШШШ Y. К ИШК ч KON S кини 


Vitis LIMINE аа аан а EAN EAA NS ESLER ы VEL Te Tee Te nie emis Teta 

Fig.2 Alignment of three pseudogenes 
A. Alignment of the ycf75 gene and protein sequences in the two representative species of angiosperms [ Nicotiana (NC_001879) and Atropa 
(ХС. 004561) |. Black asterisks indicate stop codon in protein. Red arrows indicate the insertion region * TCTA’ in apple. B. Alignment of 
the ycf68 gene and protein sequences in the two representative species of angiosperms | Zea (NC_001666) and Oryza (С. 001320) |. Red 
arrows of box indicate the * AAAC’ unit missing in apple. C. Alignment of the infA gene and protein sequences in the three representative 


species of angiosperms [ Vitis (NC_007957) and Sasamum ( NC. 016433) |. Red arrows indicate the ‘AGAT’ unit missing in apple 
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Fig.3 Repeat structure analysis in the apple cp genome 


Тһе cutoff value for tandem repeat is 15 bp and 30 bp for dispersed repeat. (А) Histogram showing the number of repeats in the 


apple chloroplast genome. ( B) Composition of the 30 repeats. ( C) Location of 30 repeats 


The apple cp genome exhibits largely identical 
gene order and content to most sequenced angio- 
sperm cp genomes, emphasizing the highly con- 
served nature of these land plant cp genomes 
( Wicke et al., 2011). Its GC content is in accor- 
dance with the typical angiosperm cp genomes ( Shi- 
nozaki et al., 1986; Kim and Lee, 2004; Hiratsuka 
et al., 1989; Sato et al., 1999; Terakami et al., 
2012). The codon usage bias towards a higher AT 
representation at the third codon position was also ob- 
served in other land plant cp genomes ( Yang et al., 
2010; Nie et al., 2012; Yi and Kim, 2012; Tang- 
phatsornruang et al., 2010; Qian et al., 2013). 

Three genes are non-functional in the apple cp 
genome, the ycf15, infA and ycf68. Both ус/15 and 
ycf68 contain four internal stop codons. These two 
pseudogenes has been rarely mentioned in previous 
studies ( Ravi et al., 2007; Shi et al., 2013) and 
were not annotated in the other four reported Rosace- 
ae cp genomes. The validity of ус/15 as а protein- 
coding gene has long been questioned ( Chumley et 
al., 2006; Steane, 2005). Though, Shi ей al. (2013) 
have suggested the ус/15 gene was transcribed as 
precursor polycistronic transcript which contained 
yc/2, ус[15 and antisense trnL-CAA in the Camellia 
transcriptome. This gene is disabled in some of angi- 
osperms such as Amborella ( Goremykin et al., 
2003) and Nuphar (Raubeson et al., 2007), mono- 


cots, most rosids, and some other separate lineages 


(Shi et al., 2013). In apple, the imperfect ycf75 
gene indicates that it is probably a remnant of a 
functional gene in one of its predecessors. The ус/68 
sequence, which occurs in the trn/-GAU intron, has 
been proved to be a functional protein encoding gene 
in rice, corn, maize and Pinus ( Raubeson et al., 
2007). However, Raubeson et al. (2007) analyzed 
this gene in 14 angiosperms and exhibit multiple 
frameshifts caused internal stop codons in most ca- 
ses, which is proved again in apple. Coding transla- 
tion initiation factor 1, infA gene stands out as an 
unusually unstable angiosperm chloroplast gene, 
which has been detected to be lost from the chloro- 
plast genome on many separate occasions especially 
in Eurosids and transferred to the nucleus multiple 
times (Millen et al., 2001). The three eurosids taxa 
( Eucalyptus, Populus and Jatropha) contain infA, 
however was proved to be pseudogene with multiple 
stop codons ( Asif et al., 2010). Our results tell a 
same story of infA in apple. Why these three genes 
degenerated in some land plant cp genome deserve 
further study. 

Most repeats are located in the intergenic spac- 
ers and introns, but several occur in tRNA genes 
and CDS. Short dispersed repeats are considered to 
be one of the major factors promoting cp genome re- 
combination and rearrangement because they are 
common in highly rearranged algal and angiosperm 


genomes, and many rearrangement endpoints are as- 
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sociated with such repeats ( Lee et al., 2007; Yue et 
al., 2007; Haberle et al., 2008; Pombert et al., 
2005; Chumley et al., 2006). In the un-rearranged 
cp genome, most of the repeats are located mostly in 
intergenic spacer regions and introns, although sev- 
eral are located in the protein-coding genes of рза А, 
psaB and ycf2 (Daniell et al., 2006; Timme et al., 
2007; Saski et al., 2005). Repeat analysis of apple 
cp genome was carried out for the five available Ro- 
saceae cp genomes for the first time, which will pro- 
vide more informative sources for developing markers 
for its population and phylogeny studies. 

Іп our study, we detected 237 SSRs with une- 
ven distribution in the apple cp genome. Most of the 
SSRs were found in the nocoding regions, which is 
not unusual as a result of the higher number of muta- 
tions within these regions compared with more con- 
served coding regions ( Ebert апа Peakall, 2009). 
Additionally, there was a significantly larger number 
of A and T microsatellites than G and C, which has 
been reported previously in other taxa ( Kuang et 
al., 2011; Qian et al., 2013; Raubeson et al., 
2007 ). SSR is another repeat type which is based on 
simpler motif and shorter than aforementioned re- 
peats. SSRs have been used to obtain high resolution 
in some closely related plant taxa, proving to be ef- 
fective genetic markers to study plant breeding, pop- 
ulation genetics, biological conservation, mating 
systems, and uniparental lineages ( Terrab et al., 
2006; Cardle et al., 2000; Peakall et al., 1998). 
By analyzing the complete chloroplast genome of ap- 
ple, we hope to facilitate future studies by selecting 
target regions for more in-depth population studies 
within the genus. 

3.2 Implications for Chloroplast Genetic Engi- 
neering 

Chloroplast genetic engineering is exemplary for 
its unique advantages including the possibility of 
multi-gene engineering in a single transformation e- 
vent, transgene containment due to maternal inheri- 
tance, high levels of transgene expression and lack 


of gene silencing ( Daniell, 2007; Verma and Daniell, 


2007; Verma et al., 2008). Foreign gene integra- 
tion in to the chloroplast genome occurs via homolo- 
gous recombination of flanking sequences used in 
chloroplast vectors ( Verma and Daniell, 2007). 
Chloroplast transformation has made significant pro- 
gress in the model species tobacco as well as in a 
few major crops, such as potato, tomato and cotton 
(Verma et al., 2008; Verma and Daniell, 2007). 
Although the trnl-trnA and accD-rbcL intergenic 
spacer regions have been widely used as gene intro- 
duction sites for vector construction ( Verma et al., 
2008), the transformation efficiency is impaired 
when the sequences for homologous recombination 
are divergent among distantly related species ( Ruhl- 
man et al., 2006). However, spacer regions are not 
100% identical even in members of the same family. 
Comparison of intergenic spacer regions among mem- 
bers of Solanaceae revealed that only four regions are 
identical ( Daniell et al., 2006). Similarly, compar- 
ison of intergenic spacer regions of nine grass cp ge- 
nomes revealed that not even a single spacer region 
is identical among all sequenced cp genomes ( Saski 
et al., 2007). Terakami et al. (2012) investigated 
several deletions and insertions in the intergenic 
spacer regions amongst the Pyrus, Malus and Prunus 
cp genomes, such as ndhC-trnV, trnR-atpA , rpl33- 
rps18 , psbl-trnS and accD-psal. There are no inter- 
genic spacer regions with 100% identity in the Rosa- 
ceae available cp genome. The availability of the 
complete cp genome sequence of apple is helpful to 
identify the optimal intergenic spacers for transgene 
integration and to develop site-specific cp transfor- 
mation vectors. Using cp genetic engineering to in- 
troduce useful traits, such as pests resistance and 
drought tolerance, might be other applications to im- 


prove this economic plant. 
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Supplementary Table 1 Codon usage and codon-anticodon recognition pattern for tRNA in the apple cp genome 








Amino acid Codon No. RSCU tRNA Amino acid Codon No. RSCU tRNA 
Phe UUU 976 1. 30 Ile AUU 1114 1. 46 
Phe UUC 529 0. 70 trnF-GAA Пе AUC 436 0. 57 trnI-GAU 
Leu UUA 901 1. 94 trnL-UAA Ile AUA 732 0. 96 trnl-CAU 
Leu UUG 567 1. 22 trnL-CAA Met AUG 627 1. 00 trnfM-CAU 
Leu CUU 585 1. 26 Thr ACU 545 1. 59 
Leu CUC 186 0. 40 Thr ACC 253 0. 74 trnT-GGU 
Leu CUA 362 0. 78 trnL-UAG Thr ACA 422 1. 23 trnT-UGU 
Leu CUG 180 0. 39 Thr ACG 151 1. 23 
бег UCU 573 1. 68 Asn AAU 990 1::53 
бег UCC 331 0. 97 trnS-GGA Asn AAC 305 0. 47 trnN-GUU 
Ser UCA 409 1. 20 trnS-UGA Lys AAA 1067 1. 49 trnK-UUU 
Ser UCG 190 0. 56 Lys AAG 369 0. 51 
Ser AGU 409 1. 20 Val GUU 521 1. 45 
Ser AGC 138 0. 40 trnS-GCU Val GUC 162 0. 45 trnV-GAC 
Tyr UAU 796 1. 61 Val GUA 552 1:59 trnV-UAC 
Туг UAC 194 0. 39 trnY-GUA Val GUG 205 0. 57 
Сув UGU 228 1. 49 Ма GCU 633 1. 83 
Cys UGC 79 0. 51 trnC-GCA Ala GCC 216 0. 62 
Trp UGG 457 1. 00 trnW-CCA Ala GCA 390 1: 13 trnA-UGC 
Pro CCU 415 1. 54 Ala GCG 145 0. 42 
Pro CCC 208 0. 77 Asp GAU 890 1.62 
Pro CCA 304 1. 13 trnP-UGG Asp GAC 211 0. 38 trnD-GUC 
Pro CCG 149 0. 55 Glu GAA 1036 1. 49 trnE-UUC 
His CAU 494 1. 55 Glu GAG 358 0. 51 
His CAC 145 0. 45 trnH-GUG Gly GGU 581 1. 31 
Gln CAA 721 1,53 trnQ-UUG Gly GGC 188 0. 42 tmG-GCC 
Gln CAG 219 0. 47 Gly GGA 718 1. 62 trnG-UCC 
Arg CGU 341 1. 27 trnR-ACG Gly GGG 289 0. 65 
Arg CGC 116 0. 43 Stop UAA 53 1. 77 
Arg CGA 367 1. 37 Stop UAG 21 0. 70 
Arg CGG 123 0. 46 Stop UGA 16 0. 53 
Arg AGA 491 1. 83 trnR-UCU 
Arg AGG 171 0. 64 








RSCU; Relative Synonymous Codon Usage. Values in bold represent that the most common code for that amino acid 
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Supplementary Table 2 Repeated sequences in the apple chloroplast genome 
Repeat aby pe Шашы Repeat Unit Region 
number 
1 31 Е tmS-GCU, tmS-UGA AAACGGAAAGAGAGGGATTCGAACCCTCGGTA LSC 
2 32 F psaB (CDS), psaA (CDS) CGCAATAGCTA AATGATGATGAGCCATATCGGT LSC 
3 31 p 165 GrnR-UCU, арА), 165 (т! - рад ATAA ATATATTTTATATTCTAATATAT LSC 
UGU, trnL-UAA) 

4 30 F IGS (ndhC, traV-UAC) ТТТТТТАТТТТАТАТАСТАТАТАСАТАТАСТ LSC 
А T И rB кш IGS (ps12. 3* end, ee ice Жі 
E - қ сз E end, trnV-GAC) , атты jk ids 
7 30 I tmS-UGA, IGS (yef3, tmS-GGA) AAAGGAGAGAGAGGGATTCGAACCCTCGATA LSC 
8 30 I трП6 (CDS), IGS (ndhF, rpl32) АТТСТТТТТТТТТТТТТТТТТТТАТСТАААА LSC, SSC 
9 40 I ndhA (intron) , IGS (іт/-САС, rps12) ы ү D E SSC, IRa 
10 37 T 168 (irnK-UUU , та) TTAATTTTTTGTTATCTC (X2) LSC 
11 33 T 168 (rpsl6, trnQ-UUG) ATTATAGATTAATAAA (X2) LSC 
12 43 T 165 (ітс-ССС, trR-UCU) TAATAAGAAATAAGAAAAAAA (X2) LSC 
13 45 T 168 (irnR-UCU, арА) ATAAAGATATICTAAATTAATAA (X2) LSC 
14 33 T 165 (aipF, atpH) TGGAAATTTCCAATAAG (X2) LSC 
15 39 T 168 (trnC-GCA, petN) ТТСТААТАСАТСТААТТААА (X2) LSC 
16 39 T 168 (trnT-GGU, psbD) GTAATAAAGTAATAAAAAAA (X2) LSC 
i 12 Жы eo л $86 
18 36 T 165 (trnT-GGU, psbD) AGTAGAAAGTAATAAAAT (X2) 15С 
19 91 T  IGS (psbZ , trnG-UCC) TATTAAATATGGATTGTATATATTGTA (X3) LSC 
20 39 T 168 (trnT-UGU, trnl-UAA) AGAACATACCTATTAATATA (X2) 15С 
21 45 T 168 (trnT-UGU, trnL-UAA) TTTTTTTGTTATGTTATAATGTT (X2) LSC 
22 31 T 168 (ndhJ, ndhK) TTTGTTATTCTGTACA (X2) LSC 
23 36 T — IGS (trnV-UAC, tmM-CAU) TTTGATTGGTATTGCTTA (X2) 15С 
24 45 T ICS (ар, ра) TAGAAACATGTAGAAAGATGAAT (X2) 15С 
25 39 T — 16S (acD, psal) AATTAATATATATTICTTA (X2) LSC 
26 30 T 165(рвА, psbJ) TTTATTAGATTAAATA (X2) 15С 
27 44 T 1С5(ряЕ, рей) TTTCAATTGAATTTATCC (X2) LSC 
28 37 т 16S (79133, rps18) TAAATAGAAATAAATATAA (X2) LSC 
29 45 T — pP (intron) AAATATCAAATAAATTAAATATA (X2) LSC 
30 38 T IGS (ndhE, adhe) AGATTCAATTGACTAGAAT (X2) 55С 


F: Forward; I; Inverted; Т; Tandem; IGS; Intergenic spacer; CDS; Coding DNA Sequence 
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